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SINGULAR VALUE 
DECOMPOSITION 
AND REGRESSION 


Coveriance Matrix / 
o- ff 
r is proportional to the diagonal matrix 
ie ? composed of only the diagonal of the 
variance-covariance matrix in multiple 
linear regression. The variance-covari- 
2 ance matrix for the design matrix denot- 
ed as V, is related to the diagonal singu- 
lar value matrix, denoted as S, through 
Ma=TS-Y* the following equation: 
V =07(XTX)-! =07US?UT , 
where o° is the variance of the errors in the regression model, X is the design matrix, 


U is the matrix of left singular vectors, and S is the diagonal matrix of singular val- 
Hes. 


Singular Value Matrix 


The singular values in S represent the square roots of the eigenvalues of the ma- 
trix X7X. They provide information about the amount of variation in the data captured 
by each of the orthogonal components in the matrix U. The diagonal elements in JV, 
on the other hand, represent the variance of each coefficient estimate in the regression 
model. 

Thus, the relationship between the variance-covariance matrix for the design 
matrix and the diagonal singular value matrix is that they are related by the variance 
of the errors in the model and by the decomposition of the design matrix into its or- 
thogonal components, represented by the singular value matrix and the matrix of left 
singular vectors. m 
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MULTIPLE LINEAR 
REGRESSION AND 


FOURIER 


Getting Rid of Noisy Signals 
From A Regression POV 


Signal distributions are of the form y(t) = x(t) + y(t), 
where y(t) is our observed outcome, x(t) is the desired distribu- 
tion, and 7(t) is the Gaussian white noise. 

Suppose we want to derive x(t) from our observed y(t). In 
that case, we can first compose our design matrix X, which has 
each column following a sinusoidal distribution with different 
periods and amplitudes. Then, we can estimate our coefficient 
matrix B, which has MLE equivalent to (X7X)1X7y(0). 

If y() is highly correlated with the distribution of a par- 
ticular column in_X, then the corresponding coefficient in B for 
that column will also be high, and vice versa. That way, we can 
filter out all the unwanted Gaussian noises using linear regres- 
sion. 


o = Yew oe a oN 
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SERIES 


Signal filtering 
and linear regres- 
sion with a design 
matrix with each 
column being a 
sinusoidal distri- 
bution are related. 
They both involve 
working with sig- 
nals that can be 
decomposed into 
different frequency 
components to ex- 
tract useful infor- 
mation from sig- 
nals. The focus of 
using linear regres- 
sion is identifying 
and estimating 
each component’s 
contribution to the 
response variable, 
and finally remov- 
ing the unwanted 
components. 
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LEAST SQUARES AND 
THE MINIMIZATION OF 
RECURSIVE RESIDUALS 


-15 
1981 1986 1991 1996 2001 2006 = 2011 2013 


Least Squares = Minimization of Recursive Residuals 

In multiple linear regression, the maximum likelihood estimate of B can be 
found using finding the minimum point of the convex least square equation using gra- 
dient descent. Usually, we see the process as a process of finding the minimum value 
of the least squares equation with regard to 8. However, the process can also be seen 
as the minimization of recursive residuals. 

Since the fitted least squares line can be seen as a projection of the observed y 
value to the [ = XB hyperplane containing all the values of y, the errors are naturally 
the distance between the observed y and y in I and are orthogonal to every ». Thus, 
the errors yi can be expressed as: a 

e(Yi, Hi) = Yi — Projpy: = yi — VSS. 

If we write our design matrix _X as a matrix of its columns [X7, X2, ..., Xn], where 

each .X; is the 7” column of _X, then, the least squares equation can be expressed as: 
IY — XB] = ||[¥ — X18, — X282 — ... — Xn Bnll?, 

We can then choose a particular XiB; and fix the rest of the equation ||(Y - X7B: - ... - 

Xi-iPi-r - Xi+iPits - ... - XnBn) - XiBil|? to find the MLE of B:. In other words, we use_Xi to 

predict (Y - X7B: - ... - Xi-iBi-1 - Xi+ Pits - ... - XnBn). Without loss of generality, pick 
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out the column_X7 and its corresponding coefficient B:. The MLE of B can be found 
using the formula §; = S28 >+—Anba Xie — S¥> _ g) Shae — |. 8, Sees 
since the MLE of any coefficient matrix B is (X7X)/X’Y. Sub Ttne die value of B: 
into the least squares equation and applying the formula for residuals, we get a par- 
tially minimized least square loss function which has the coefficient 37 minimized to 


its optimized value. That is, 


(¥ ~ X28 i XnBn) ~ X;8,||? < lle(Y, X) = Bye(Xy,X,) = ca) = Bne(Xn, X1)||?. 
ye 
ENN Ya 


the estimate 


XB = Ya 
the estimate ; 
parallel to the plane spanned by X 


Y La 
perpendicular residual vector 


Continuing this pattern, we can further express the residuals in the equations as resid- 
uals of residuals, and so on. That is, we predict the value of (e(Y X7) - B2-e(X2, X7) 
- ... - Bi-r-e(Xi-1, X71) - Bit1 + e(Xi-1, X7) - ... - Bn (Xn, X7)) using e(Xi, X7). Without loss of 
generality, let i= 2. So, we obtain: 
lle(¥, X1) — Bye(X2,X1) — ... — Bne(Xn, Xi) ||? < 

lle(e(Y, X1), e(X2, X1)) — Bge(e(X3, X1), e(X2, X1)) — -.. — Bne(e(Xn, X1), e(X2, X1) Mr 
thus also eliminating B2. At last, we obtain an expression with only two residual — 
given by e(e(...), e(...)) - Bn- e(e(...), e(...)) with the first term relating ultimately to the 
residual of Y with X7, and the last term relating ultimately to the residual of Xn to X7. 
This way, we can calculate the inverse of the design matrix X and directly calculate 
the optimal value of B by first finding the optimized value of Bn, as every other value 
of Bidepents on all the values of B;, fori<j <n. = 
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ANCOVA & COVARI- 
ANCES IN REGRESSION 
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ANCOVA ys. Linear Regression 


ANCOVA is a statistical method used to test for significant differences between 
means in two or more groups. On the other hand, linear regression is a method used 
to model the relationship between a dependent variable and one or more independent 
variables. 

ANCOVA and linear regression are related in that they both use the same basic 
framework of partitioning the total sum of squares into different components. AN- 
COVA can be thought of as a special case of linear regression where the independent 
variable is categorical. 


Categorical Label in Regression 


Many times we encounter cases in which there are underlying confounding 
variables that will affect the output of our regression model. For instance, in a case 
where we want to predict the score of students from the number of hours they spent 
reviewing, and there are three different, independent classes, then the classes that 
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students attend is a possible confounding variable. We might obtain a graph that 
appears to be downward sloping, because students in certain classes are given lower 
scores in general but review for longer hours, but within each group, there is a posi- 
tive correlation. Thus, it would be better to split our regression model into these mul- 
tiple groups and analyze their variances and coefficients, respectively. 

Let’s first consider the scenario with two different groups. Let_X be the original 
design matrix and Y be the column vector that can be divided into two independent 
groups. Define W as the matrix [Z X], where 


_ J ny O Thy 
z= lon on 
where On be a column vector consisting of n Os and 1: + n2 =n, meaning the sum of 
the length of data in two groups is the full length of the dataset. Let B be the common 
slope across both groups when we fit Y against_XY. Define I as the following matrix: 


r= 2 


where wu: is the intercept of group 1, and u2 is the intercept of group 2. 

If we minimize the new least squares expression ||Y - WI'||? = ||Y - XB - Zu’, 
where pu is the column vector consisting of t1 and 2. If we fix the value of (Y - XB) 
and use Z to predict the value of (Y - XB). The maximum likelihood estimate for pu is 
the mean of (Y -_XB). Let Yi and ¥2 be the column vectors of the two groups in our 
label, and Xi and_X2 be the matrix representation of two groups in our design matrix, 
which are of length ni and nz, respectively. So, wi= Yi - XiB and p2 = Y2 - X28. If we 
substitute the above results back to our least squares equation, we get: 


_ _ a2 — yn-MN Jns _ Xi- 2 
wv - zr xair=li|p 9 | - x ar 


from which we can derive the maximum likelihood estimation of B with the design 
matrix and output vector both divided into two groups. 


The MLE of B follows 
8= <Y~Y.X-~X> > 1 oj ba (sy — vy) (zty —Zy) 
7 <X-X,X-X> Me ay he. (Tig #y)? 


for each column of X. We observe that the numerator of the expression contains the 
summation of the MLE of the coefficient matrix when we fit_X7 with Y; and when we 
fit X2 with Y2. We call these coefficient matrices Bi and 2, respectively. Henceforth, B 
= PBi + (1-P)B2, where P = (3) (xij - ¥1))/(Xi Y; (xy - Xi)*), since the denominator can be 
seen as a weighted average between the two groups. Thus, we have derived the MLE 
for 8 that adjusted for the baseline weight/coefficient for the two groups. m 


Bs, 
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INTUITION OF THE COR- 
RELATION COEFFICIENT 


SSR, SSE, and SST 

The Sum of Squares Regression j Sercties ° |. | 
(SSR) is defined as the summation of the iacliaa ceca, e@, 0 Ja Bani 
squares of the difference between the eo © 
fitted y values and the mean of observed y ‘ e d Sum of Square 
values, which isameasure ofthe amount | © ®@ ~ |) y 
of variation in y that is explained by the e “@ [ seat 


linear model, which follows X(j-/)’. Variable Value 
The Sum of Squared Errors/Resid- |‘. © 

uals (SSE) is the summation of all the 

squared residuals, which follows X(y-y). 
The Total Sum of Squares (SST) is defined as a statistical measure of deviation 


from the mean, which follows X(y-y)’. 


Proof of “SST = SSE + SSR” & Partition of Variances 
We can write the design matrix X with a [Jn, X°], with Jn being a column matrix 
composed of 7 1s and_X’ the rest of the X matrix. 
The SSR of the model can be expressed as: 
|? - YIP 
= ||HveY- Hye Y|/? 
= || - Ax) ¥|/? 
= Y'(Hi - Hx)"(H7 - Hx)Y 
= Y"(Hz - Hx)(Hs - Hx)Y 
= "(AH + Ax"Hy- Hs Hx- Hx'Hs)Y 
= Y"(Hv+ Hx- 2HsHx)Y, 
as the hat matrices have the property that H’H = HH = H. From the fact that the dot 
product of the residuals and any vector lying on a linear combination of the design 
matrix X is equal to zero, as proven by the fact that for a linear transformation of the 
X (XB =T), eeXB = YU-Hx)XB = Y(X-HxX)B = Y(X-X)B as HxX =X, and that Jn is al- 
ways proportional to the coefficient column in_X, so it lies on some XP, we obtain that 
eeJn = YU-Hx)Jn = 0. This further implies U-Hx)Jn = 0 as Y #0. We then obtain: 
Jn= HxJn 


> 
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Inn Iny "In? = HxIn (In In In 
Aly = Axffs 
Ay = Hix. 
Hence, the expression for SSR Y"(Hy + Hx- 2HvHx)Y further evaluates to 
Y"(HjHx+ Hx- 2HiHx)Y = Y"(Hx- HiHx)Y = Y"(Ax- AY. 
The SSE can be expressed as: 
IY- YIP 
=||¥-XOCXy XP 
= ||(0- XXX)'X) YIP 
= Y'UI- Ax)Y. 
The SST can be expressed as: 
IFIP 
= |[Y - In(Jin ny In Y||? 
=||Y- Hse YIP 
= ||1-22) Y|P 
= Y'U-H,)'U-H7)Y 
= Y'U-H))Y, 
as (I-Hs) is idempotent. The SST can further be expressed as the summation of SSR 
and SSE: 


IFIP 
= Y'U-Hy)Y 
= Y"U-Hx+Hx-Hs)Y 
= Y'U-Hx)Y + Y"(Hx-H)Y 
= SSE + SSR. 
Therefore, we have completed the proof that SST = SSE+ SSR. 

From the result above, we can partition the variance of our linear model into 
the variance of the residuals and the variance of the regression model. The square of 
the correlation coefficient, r, is defined 
as the quotient of SSR and SST (SSR/ 
SST). It can be interpreted as the propor- 
tion of total variability explained by the 
linear association with the added regres- 
sors. 

Intuitively, it also follows that the 
larger the SSR, the more variability in the 
set of data can be attributed to the model, 
and thus the greater the value of 7”. = 


Error: Y; — Y 


Total: ¥; -Y —» 
e 


15 


16 


BIZZARE MATHEMATICS 


DISTRIBUTION OF 
THE PREDICTED Y 


- : Input, x | " 
2.02 2.0 
Y Y 
=2.0 2.0 
Pr(ylor=2) Priyi¢=f) 


Chi-Squares In Quadratic Forms 

We know that for an X¥ ~ N(u, 2), where & is the vari- 
ance-covariance matrix. Then, (X-y)"!” ~ N(0, 1), a standard 
normal distribution. The square of each 1.1.d. data point in (X-) 
x"! follows a chi-squared distribution; that is, (X-u)'Z'(X-p) ~ 
y° with n degrees of freedom. 


ia Interested in 
learning more 
about this maga- 
zine? 

Contact the author 
at sophia. yx.zhu@ 
gmail.com. 


Random variables 
following a nor- 
mal/Gaussian 
distribution can be 
X=p 
expressed as =, 
where o is the 
population standard 
deviation. 
Random vari- 
ables following 
a Student’s t-dis- 
tribution can be 
X=p 
expressed as = , 
where S is the sam- 
ple standard devia- 
tion. 
Random variables 
following a chi- 
squared distri- 
bution can be ex- 
pressed as =", 
A t-distribution can 
be seen as a normal 
distribution over 
the square root of a 
y’ distribution di- 
vided by its degrees 
of freedom. 
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In fact, if we have a matrix A such 
that AX is idempotent, then (X-)'4A(X-) 
also follows a chi-squared distribution, as 
proved by the following. 

Let p be the rank of A and n be the 
rank of X with p <n. Since A is idempo- 
tent, we have AXA = AX, which further 
implies AXA = A. 

If we write out the eigenvalue de- 
composed version of 4 = VDV', where 
V is of size n xp and D is of size p Xp, we 
can rewrite ALA as VDV'ZVDV' =A = 
VDV*". This equation further implies: 

V'IVDV'XVDV'V= V'VDV'V. 
By the property that V'V =/ (as V is an or- 
thogonal matrix of eigenvectors), we can 
simplify the above equation: 
DV'XVD =D 
D!?2V!'XVD!2 — T. 

We then create the random variable 
defined as D!?V'(X-), which follows 
N(O, D!?V'ZVD"”) = N(O, 1). The square 
of this 1.1.d. normal distribution 
(X-p)'V'D!DIPV(X-p) = (X-p)'AC-p) 
~ y° with p degrees of freedom. 


Variance of Residuals & 


Chi-Squares 
In this section, we will show that 
the dot product of the residuals e to itself 
divided by the population variance fol- 
lows a chi-squared distribution using the 
previous result. 


ere _ Y7(I-Hx)Y _ (Y-—X8)? (I-Hx)(Y-X8) 
ee — pA Soe LB _ 


a GP ao- 


The second part of the equation is valid 
as HxX = 0, so the additional XB term 


does not add anything to the previous 
equation. The expression is of the form 
(Y-Y)"A(Y-Y) where A = (I-Hx)/o?. Check 
that AL = U-Hx)/o? x(o7/) = “(Ls Dy an 
idempotent matrix. Thus, = ~ 72. Anoth- 
er way to write this expression 1s — 
Its degrees of freedom is rank(/-Hx) = n-p 


for a design matrix _X with a rank of p. 


T-Distributed Coefficients 


Let g be a column vector that com- 
poses of all zeroes except for certain 
positions that we want to observe in B 
and B be the coefficient matrix. For in- 
stance, if we want to only pick out the i” 
column in B, then we will only assign the 
i position in g to be 1. The covariance 
of g’B and the residuals e is cov(q’, e) = 
cov(q’ (XTX) XY, U-Hx)Y) = q™(X™X) |X" 
*cov(Y, Y) * U-Ax)" = q(X"Xy'X"(U-Ax) 
°o°. Since X7(/-Hx) = 0, the residuals are 
orthogonal (independent) to q’B. 

We know that the g’B ~ N(q’B, 
q'(X'X)'q 0’), since we assume normality 
in our residuals and thus normality in our 
response variable Y, as well as the fact 
that the least squares equation 1s based on 
the MLE of normal curves. “=~ ~ N(.0), 
If we divide the standard normal above 
by the wesc root of the chi-squared dis- 


tribution of divided by the degrees 
of freedom, we ei a t-distribution as the 
following: 

a7 8-9" 8 N(0,1) qr a-qT 8 


fuoms?_.f x2_, SVqth(XTX)-1q FP 
\ (ape? ~Y Teas 


showing that each coefficient in 8 follows 
a t-distribution if we create an interval 
using the sample standard deviation. = 
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DERIVATION OF THE 


F-STATISTIC 


Why F-Test? 


If we want to compare whether a 
column in_X is significantly correlated 
with Y, we consider using the F-statistics 


defined as: 
(KB—Kp)™ 


K(X7X)7!KT)—!(K8-—K8) 
rank( K)S* 


2 


where K is a full row rank contrast matrix 
that tests whether a linear combination 

of certain columns in_X is a significant 
predictor of Y (or whether the coefficients 
for a group of predictor variables are 
jointly equal to zero). 

The F-statistic is calculated by 
comparing the variance of the residuals 
in the reduced model (1.e., the model 
without the predictor(s) associated with 
the null hypothesis) to the variance of the 
residuals in the full model (1.e., the mod- 
el with all predictors). The F-statistic is 
the quotient of the two chi-squared dis- 
tributions, with the numerator being the 
distribution of the variance of the reduced 
model and the denominator being the dis- 
tribution of the variance of the full mod- 
el, weighted by the degrees of freedom 
associated with each model. According 
to the formula above, the F-statistic can 
also be seen as the square of a t-distribu- 
tion divided by rank(K), or the degrees of 
freedom of the reduced model. 


AND TEST 


What F-Test? 


The intuition behind F is the ratio 
of the variances of two models. The test 
follows Ho: Bi = B2 =... = Ba = 0, or KB = 
0, against Ha: at least 1 B 40., or KB £0. 
If our model is a reduced model with p 
number of predictors, we want to deter- 
mine if the reduced model, a subset of the 
predictor variables in a full model, sig- 
nificantly contributes to the prediction of 
the response variable compared to the full 


model.. The standard F-statistics is: 
MSR 
MSE 


in which the numerator is the SSR divid- 
ed by the degrees of freedom of the SSR 
of the reduced model (p) and the denomi- 
nator is the SSE divided by the degrees of 
freedom of the SSE of the reduced model 
(n-p), in which n is the total number of 
observations. Both the numerator and the 
denominator are chi-squared due to the 
residual normality assumption divided by 
their degrees of freedom, a scaling factor 
that makes the quotient justifiable. Since 
MSR measures the variation in the re- 
sponse variable explained by the regres- 
sion model, and MSE measures the unex- 
plained variation in the response variable, 
the statistic 1s a signal-to-noise ratio. 

The SSR of the reduced model is: 
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(K8—K8)™ (K(X7X)~!KTo?) 
which follows a chi-squared distribution 
with rank(K) = p degrees of freedom. The 
numerator of the F-statistics, or the mean 


squares of regression, 1s: 
(KA—KB)7 (K(X7 X)~' KT oa7)—'(KB-KB) 


'(KB—KB) , 


rank(/ ) 


The denominator, on the other hand, is 
the sum of squared residuals divided by 
its degrees of freedom (n-p). It is the dis- 
tribution “2 further divided by (n-p) = 
(n - rank(K)). As a result, the F-statistic 


for testing whether KB =0 simplified to: 
(K8—K8)" (K(X7X)~'K7)~'(K8—K8) 
rank( K)S- 


d1=1, 
d1=2, 
d1=5, 
d1=10, d2=1 


d2=1 
d2=1 
d2=2 


d1=100, d2=100 


Try This Question: 

Which distribution does the expression 
0.5((KB — m)7(K(X7X)-!K7S?)-"(KB — m)], 

follow if we are testing against the null 

hypothesis that KB = m? 

1. Ax? distribution with n-rank(K) de- 

grees of freedom. 

2. The square of a t-distribution with 

n-rank(K) degrees of freedom divided by 

2. 

3. A F-distribution with 2 numerator de- 

grees of freedom and n-rank(K) denomi- 

nator degrees of freedom. m 


i Contact the author to learn more. 
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PREDICTION 
INTERVALS 


Confidence vs. Prediction Intervals 


A prediction interval is a statistical measure that esti- 
mates an interval in which future observations are likely to fall 
based on a given set of predictors. It is similar to a confidence 
interval, but a confidence interval estimates the range of values 
that contains a population parameter with a certain level of con- 
fidence, a prediction interval estimates the range of values that 
contains an individual observation with a certain level of confi- 
dence. While the 95% confidence interval of 7 1s yo + tos%, af=n-p 
© S(x0"(x0"x0) !xo)'”, where the variance is Var(jo), the variance 
of prediction intervals for regression coefficients are derived as 
follows: 

Var(e) 
= Var(y - yo) 
= Var(y - xo) 
= 0 + (xo0"(x0"x0)'x0)07 
= (1+x0"(x0"x0) 'xo)o°. 

Thus, the 95% prediction interval 1s ~o + tos%, df=n-p * 
S(1+x0"(x0"xoy'x0)"”, where Var(e) is used since Var(yo) can be 
separated into two parts: the variance explained by the predic- 
tion model and the unexplained error variance Var(e), and we 
only want to include Var(e) that captures the variation of the 
individual response around the predicted response. 

Prediction intervals don’t converge to a single point es- 
timate like confidence intervals. In other words, a confidence 
interval accounts for the uncertainty in estimating a population 
parameter from the sample, while a prediction interval accounts 
for the additional uncertainty in predicting the value of a future 
observation from the population. This additional uncertainty 
arises from the inherent variability of the response variable, 

which cannot be fully captured by the sample data alone. = 


LEAVE-ONE- 
OUT RESIDUALS 


Outliers & Leave-One-Out Residuals 


Usually, we want to eliminate the noises (data points 
that are contaminated or provide false influence on our linear 
models). These outliers are usually found using the 1.5 x IQR 
rule or the within 2 standard deviation rule (within 95% of the 
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spread assuming normality). 


Leave-One-Out (LOO) residuals are a type of cross-val- 
idation method used in regression analysis to evaluate the per- 
formance of a model. The LOO method involves removing a 
single observation from the dataset, fitting the model on the 


remaining 7-1 observations, and using the model to predict the 

response variable for the left-out observation. LOO residuals can be used to identify 
potential outliers or influential observations. An observation with a large LOO resid- 
ual indicates that the model fit is highly sensitive to that observation, and its remov- 
al could result in a substantially different model. Such an observation is considered 
influential and could be an outlier. A more common usage of LOO 1s to assess the 
performance of a model by calculating metrics such as the Root Mean Squared Error 


With Outliers 


50 


dist 


Outliers removed 
A much better fit! 


(RMSE) or the Mean Absolute Error 
(MAE) based on the n LOO residuals. 
These metrics provide a measure of how 
well the model is able to predict the re- 
sponse variable for new observations not 
included in the original dataset. 

The LOO residual is calculated as 
yi - yi’, where yi is the predicted value 
for observation i from the model fitted on 
the remaining n-1 observations. Let 6; be 
a column vector with all Os except for an 
1 on the 7” position. W = [X 6:], and 


aa 


J 


pal 
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where Ai is the coefficient appended to 
the coefficient matrix B for the additional 
6: column in W. Then, the nuance behind 
LOO in matrix form is presented below. 

The least squares equation now 
becomes || Y-WI|? = Si (7 - LexpeBe - 
OijAi)? = Livi (Vi - Le xy.KB)? + Gi - Lexi «Be- 
Ai)’. By separating the loss function into 
two parts, we can minimize the two parts 
respectively. The MLE of Ai is yi - Le xi«Be 
if we let (vi - Lexi xBe- Ai)? = 0. So, we get 
Diet (i - Lee xj.aBi)? + (i - Lexi. xBe- Ai)? > 
Lisi (yi - LexpeBu)? = || YO - XBI/?. The 
MLE of B in the new least squares equa- 
tion is °~. Thus, || ¥? - X°BIP2< 
|? --X BO |, Returning to the MLE 
of a we have: 

Ai =yi-d2 ri kBe ” yw”. 
This result shows that: 
1. Adding a regressor that is all 0 but an 1 
in the ith data point is equivalent to de- 
leting that data point from analysis. So, 
if the i” data point is a potential outlier 
or noise, we can simply append another 
column 6; into our design matrix X. 
2. The coefficient for 6: is equivalent to 
the LOO residual of the 7” data point. 
The LOO residual can also be seen as the 
magnitude of how different the observed 
data point is from the fitted model with- 
out the 7” data point, or how much it devi- 
ates from the average trend of the dataset. 


PRESS Residuals & 
Error Distribution 


PRESS (Predicted Residual Sum 


of Squares) residuals are a measure of the 
predictive power of a linear model. They 
are computed by omitting one observa- 
tion at a time from the data set, and then 
calculating the predicted value for the 
omitted observation. The PRESS residual 
is the difference between the actual value 
of the omitted observation and its predict- 
ed value based on the reduced model. 

Let’s start by rewriting the design 
matrix X as rows: 


1 
x= 


Ze 
The quantity X7X = &) zz, while X?7X? 
= Lizz. Therefore, XYLX = X2X - zizir. 
The inverse of X?™X can be found 
using the Sherman-Morrison-Woodbury 
Theorem as: | 
(XOPXC)-1 = (XTX) + CO EEC 
Since Hx = X(X“X)1X", the i” diago- 
nal slot Hxii = 67 XXX) 1X" bi = 
zi'(X'X)'zi. Substitute the value Hx. into 
the expression for (X’"X"")" to get: 
(x(-OT x(-i))-1 = (XTX) 1, Gx eee 
The quantity XY7Y? = X’Y - ziyi by 
the same logic used to deduce XX". 
We also know that pi? = zi (X*X) 1X? 
"Y”, with zi’ being the 7” row in_X. As we 
know, the value of (X/™X)! and X? 
ry, yi? can be written as: 
zi (x? xy} + 2D “Ts (AA) 


— Ay; Yi — 


~)(XTY — Ziyi) 


=%+ oS %, which after 


+¥+ 


1 a 
expansion ie bene to rth 
This implies ¥%~ % ” = aftr and that 
LOO residuals can be computed without 
refitting the model and follows the dis- 
tribution of the 7” error multiplied by the 
sample variance. m 
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CONFI 
ELLIP 


Ellipsoid In Vector Form 


Any arbitrarily oriented ellipsoid 
can be written as (X-v)"A(X-v) = 1, with 
the center at v. 


Relation With F 


Continuing our discussion on 
F-tests, if we want to test Ho: KB = 0 
against Ha: KB £ 0, it is equivalent to 


deciding whether our F-statistic _ 
(KB—K)7 (K(X7TX)~'K7T)—!(K8-K8) 
rank( kK )S* 


is within a certain range of the F-distribu- 
tion (e.g., 95% confidence interval). Let 
KB =™m, and rank(K) = v, the F-statistic 


becomes 
(K A—m)7 (K(XTX) 
uS- 


'KT)~*(KB 


m.) 


Bounding the statistic with the critical 
Fe values forms an ellipsoid: 


which dienes east Fort 7 Rr <1 
With A being SF Karx) , the facil 
forms the region bounded migide an el- 
lipsoid. Thus, we obtain a “confidence 
ellipsoid,” in which are the values of m 
that we fail to reyect our null hypothesis. 
In other words, m is in a certain range 
around v such that there is not enough 


evidence to reject Ho. 


wR 
it 


Mi) < Fos%(K(X7X)-1KT, S?) 
3 


D 
S 


10.5 


as (m-s”) 


Summary 


A confidence ellipsoid is a geomet- 
ric representation of the confidence inter- 
vals of multiple regression coefficients. It 
is an ellipsoid whose center corresponds 


to the estimated values of the coefficients, 


and whose shape and size are determined 
by the standard errors of the estimates 
and the level of confidence desired. The 
ellipsoid represents the set of values that 
the coefficients can take with a given lev- 
el of confidence, and can be used to test 
hypotheses and make predictions about 
the values of the coefficients. A smaller 
ellipsoid indicates higher precision of the 
estimates, while a larger ellipsoid indi- 
cates greater uncertainty. m 
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MONTE CARLO SIMU- 
LATION & PROBABILITY 


The Jackknife 

Bootstrapping is a statistical resampling technique used to estimate the distri- 
bution of a statistic. It involves repeatedly leaving one observation out of the sample, 
computing the statistic of interest for each subsample, and then combining the results 
to estimate the population statistic. The statistic of interest can be calculated from 
these resampled datasets, and the distribution of the statistic across the resampled 
datasets is used to estimate its sampling distribution. 


Monte Carlo Simulations 

Monte Carlo simulation is a computational technique that uses random sam- 
pling to simulate complex systems or processes. The method is named after the fa- 
mous Monte Carlo Casino, which is known for its games of chance and randomness. 
The Jackknife can be considered a specific type of Monte Carlo simulation. Monte 
Carlo is useful in situations where the underlying population distribution is unknown 
or difficult to model, or where the sample size is small and traditional methods of 
inference may not be reliable. 


Try These Questions: 


1. Estimate the value of z using the normal approximation of the binomial 


distribution by 
C—}simulated | sampling the pro- 


=——= Fitted 


portion points that 
lies within a quarter 
of a circle. 

2. Estimate the 
value of z using the 
density function 
v1—X? and com- 
puting the area. 
Which method 
yields smaller vari- 
ance? @ 


ar i Sie SG 
5 Ges tds DS. Hk : 
CENEOA nae, 
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INTUITION BEHIND THE 
POISSON DISTRIBUTION 


Poisson Modeling 


The Poisson distribution is a prob- 
ability distribution that is often used to 
model the number of times an event oc- 
curs in a fixed interval of time or space, 
given the average rate at which the event 
occurs. It is named after the French math- 
ematician Simeon Denis Poisson, who 
introduced it in the early 19” century. 

The intuition behind the Poisson 
distribution can be understood with an 
example. Consider a factory that produc- 
es light bulbs. The factory produces light 
bulbs at a rate of 10 bulbs per minute. We 
want to know the probability of a certain 
number of bulbs being produced in a giv- 
en time interval, say 1 minute. 

The Poisson distribution follows: 

P(X =k)= a 
where_X is the random variable represent- 
ing the number of events, k is the number 
of events, and A 1s the average rate at 
which the events occur. 


Mathematical Derivation 


The Poisson distribution is to re- 
solve the limitations of a binomial dis- 
tribution, which can only model binary 
data. 

The binomial PDF follows: 


binom(X = k) = (Z)p*(1—p)' 


The main difference directly observed 
from the equation is that the binomial 
PDF depends on the total number of ob- 
servations n, while the Poisson PDF does 
not. This is because we assume the total 
number of observations tends to infinity 
in Poisson. 

From the definition of p and A, 
we have p =i/n, as the probability of 
an event occurring 1s the quotient of the 
number of successful events and the total 
number of events. Then, as 1 approach- 
es infinity, we must also assume p ap- 
proaches 0 on an infinitely small interval 
of time. Thus, the only parameter in the 
Poisson distribution 1s A. 

Therefore, we get the following 


derivation: 
P(X =k) 
= limyn co (7) p* (1— p)”- k 
=x lim,» oo (ie) (F yE(1 — Ayn-k 
— n! A\k Xr 
— lim, OO taker (a ) (e ‘y * 1 


s Tul k 
= Titty so0( qe) (Ae)(AE)(E*) 
= lim, On a k+1 Ate 9) 


™ TL mm 


100 1+ (Ap) (e7>) 


A\—k 
4 


obtaining the Poisson distribution, which 
is used in a variety of fields to model the 
occurrence of rare events or the distribu- 
tion of counts. m 
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RANK SUM & NON-PARA- 
METRIC HYPOTHESIS TESTS 


Tests For Ranks 


Rank sum tests are non-parametric 
tests used to compare two independent 
samples. The most commonly used rank 
sum test is the Mann-Whitney U test. 

The rank sum test is often used as 
an alternative to the two-sample t-test 
when the data are not normally distribut- 
ed or when the assumptions of the para- 
metric tests are violated, such as small 
sample sizes and unequal variances. 


Mann-Whitney U Test 


The Mann-Whitney U test evalu- 
ates whether two samples have the same 
distribution by comparing the ranks of 
the data in each sample. The null hypoth- 
esis of the test is that there is no differ- 
ence between the distributions of the two 
groups. In other words, we assume that 
both groups are 1.1.d. distributed. 

The test works by assigning ranks 
to all the observations in the two groups 
and calculating the sum of the ranks for 
each group. To derive the test statistic, 
we first calculate the expected value and 
variance of the ranks. Let 14 and nz de- 
note the sample sizes of the two groups A 
and B, and W be the sum of the ranks of 
group A. Then, under the null hypothesis 


that the ranks of A are identically distrib- 
uted, E(W) = E(De4""” iC), 

1, if the i‘* smallest value in AU B is in A; 
0, otherwise. 


So, E(W) = 
na A Frese Soir J ates 


aan 


where C; = 


(natng)(natng+l) 
2 


_ na(natng+l) 
_ 2 


The variance of the test is 
E(W?) — E(w)? 
= tA (nat mola tne |i} 2n, tana i} 


= 
_ 2na(natng)(2n4a+2ng+ 1) 3n% (na net 1)? 


ne “(na +r B+ 1)" 
4 


_ nang(natng+1) 
~ 12 


Thus, we can deduce the normally dis- 
tributed test statistics, U as: 


ne — = npinatnp +t) 
nates = at”s iC; ~ N(0,1) 


=nanp 


The test statistics can similarly be de- 
duced on sample B. 


Non-Parametric Tests 


Non-parametric tests have the fol- 
lowing advantages: 

1) The loss in power is usually 
smaller than their parametric counter- 
parts; 2) They do not require normality 
assumptions; 3) They are more robust 
when dealing with outliers. 

However, they also have several 
limitations. Usually, non-parametric tests 
rely on measures such as rank rather than 
the raw data, so there will certainly be a 
loss of information. m 
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CHEBYSHEV’S 
INEQUALITY 


Bounded Probability 


Chebyshev’s inequality is a theo- 
rem in probability theory that provides a 
bound on the probability that a random 
variable deviates from its expected value 
by more than a certain amount. Specif- 
ically, for any random variable XY with 
finite mean u and finite variance o*, Che- 
byshev’s inequality states that the proba- 
bility that _X deviates from uw by more than 
k standard deviations is at most 1/k’, for 
any positive constant k. 

Chebyshev’s inequality is a useful 
tool in probability theory and statistics 
because it provides a general bound on 
the probability of extreme deviations of 
a random variable from its mean, with- 
out making any assumptions about the 
distribution of the variable. However, the 
bound provided by Chebyshev’s inequali- 
ty is generally not very tight, and stronger 
bounds can often be obtained using more 
specific properties of the distribution. 


Mathematical Derivation 


Mathematically, Chebyshev’s in- 
equality can be exvressed as: 
P(|X — p| > ko) < 
To show the inequality is valid, we can 
use the following steps: 


a Interested in 
learning more? 
Contact the author 
at sophia.yx.zhu@ 
gmail.com. 


P(X ~ pl > ko) = Jxcgix—p)>koy f(X)AX 
Since IX —p| > ko => Boel > p> Gow > 1 
we further derive that: at 

P(|X — p| > ko) < Jyetix—pi>koy ea? J (X)AX 
< pe J, SSPE f(x)ax 


Spf XF(X)dX=% 
b) 


completing the proof. However, the in- 
equality indeed has a relatively loose 
bound of the probability distribution. m 


Empirical Rule 


(Normal Distributions) 


u-lo L utlo uU+20 +30 


Chebyshev’s Inequality 
(Any Distribution) 


at least 88.89% 


——_ at least 75% ——— 


u+20 


u-30t- 20 Lu 
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THE DELTA METHOD FOR 
CALCULATING VARIANCE 


Estimation of Distribution 


The delta method is a statistical 
technique used to estimate the distribu- 
tion of a function of a random variable. 
The method is particularly useful when 
the random variable is normally distrib- 
uted or can be approximated by a normal 
distribution, and the function of the ran- 
dom variable is nonlinear. 

The delta method provides an ap- 
proximation to the distribution of a func- 
tion of a random variable by using the 
first and second moments of the random 
variable. Specifically, let XY be a random 
variable with mean wu and variance 0’, 
and let g(X) be a function of X. The delta 
method states that if X is approximately 
normal and g(X) is differentiable at p, 
then the distribution of g(X) can be ap- 
proximated by a normal distribution with 
mean g(u) and variance g’()? 0”, where 
g’(u) is the derivative of g(X) evaluated 
at LL. 


Mathematical Derivation 


Mathematically, the ee method 
(0,1), 


ae) ~ (0,1) 


g'(wI)e 
The proof is as follows: when_X is 
very close to LL, we get age we o'(u) by 


definition of derivatives. This implies: 
a(X)—9u) I: 


a’lu) 
awX)—olw) Xun 2 N(0, 1) 


a \eje 


Thus, we have completed the proof. 
Use Cases 


The delta method allows us to ap- 
proximate the distribution of a function 
of a random variable using its first-order 
Taylor series expansion. It is common- 
ly used in statistics to approximate the 
variance of sample statistics, such as the 
sample mean or sample proportion, or to 
estimate the standard error of a regression 
coefficient. It can also be used to con- 
struct confidence intervals and hypothesis 
tests for parameters in statistical models. 

The method is used in many ar- 
eas of statistics, including econometrics, 
finance, and engineering. It is particularly 
useful in estimating the distribution of a 
nonlinear function of a random variable, 
such as the standard deviation of a sam- 
ple mean, the probability of default in a 
loan portfolio, or the expected return on 
a stock portfolio. The method provides a 
simple and computationally efficient way 
to estimate the distribution of a function 
of a random variable without resorting to 
more complex methods such as Monte 
Carlo simulation. = 
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INTUITION BEHIND THE 
FISHERS EXACT TEST 


Categorical Variables 

Fisher’s exact test is used to deter- 
mine the significance of the association 
between two categorical variables. It is 
particularly useful when sample sizes are 
small, and the assumptions of the chi- 
squared test cannot be met. 

Fisher’s exact test calculates the 
probability of observing the observed 
frequency distribution or a more extreme 
one. To do this, it uses the hypergeo- 
metric distribution, which calculates the 
probability of observing a certain number 
of successes (e.g., the frequency of one 
category of X and one category of Y) in 
a fixed number of draws (e.g., the total 
number of observations) from a popu- 
lation with a fixed number of successes 
(e.g., the total number of observations in 
the other category of X, n:, and the other 
category of Y, n2). 


Hypergeometric Distribution 
The probability mass function of 
the hypergeometric distribution is: 
P(X =2|X+Y =z) =+4 


Specifically, the denominator represents 
the total number of ways to choose z 
objects out of all (m7 + 2) objects, and the 
numerator represents the joint probability 
of choosing x objects out of X and 


choosing the rest z-x objects out of Y. 


P-Value of Fisher’s Test 


Fisher’s exact test uses contingen- 
cy tables to calculate its P-value. For an 
one-sided fisher’s test, the null hypothesis 
1S p1 = p2, with the alternative hypothesis 
p> p2 (or the other way around). 

Express the data as a contingency 
table as below, where the first row is_X, 
the second row is Y, the first column is 
successes, and the second column is fail- 


ures: 
b a+b 


c+d 


b+d 


Then, we write out several other contin- 
gency tables with an increasing number 
of successes in X and a decreasing num- 
ber of successes in Y, fixing the margins 
of the conting 


Lastly, we calculate the probability of 
each contingency table occurring using 
the hypergeometric statistic and sum 
them up. The sum, Mey is the P-value 
of the test, is Diz. . Fisher’s exact 
test gives us an Hae p- sali and main- 
tains the nominal level of significance 
without requiring a large sample size. m 
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WEIGHTED 


CONFOUN- 


-DING & THE CMH TEST 


Weighting 


Simpson’s paradox refers to the 
direction of the association between two 
variables is reversed when the data is ag- 
gregated or grouped by a third variable. 
Thus, we usually stratify the confounding 
variable and combine the stratified esti- 
mates (like we previously did on ANCO- 
VA of regression). 

Suppose we have two normally 
distributed variables_XY ~ M(u, 01°) and 
Y~N (uw, 02’), stratified from the raw 
data centered at nu. The MLE of pu can be 
found using the log-likelihood of the joint 
distribution of X and Y: 

-(x-p)?/(2017) - (y-p)?/(202*) = 0, 
where x and y are instances of X and Y. 
This equation can be rewritten as: 

(x-p)/o1? = (y-p)/02°5 
solving this equation with respect to u 
implies 

MLE(u) = (wixt+wz2y)/(wi+wy), 
where w: = 1/o1? and w2 = 1/02”. The 
MLE of up can further be written as 

MLE(n) = ux+(1-w)y, 
where u = wi/(wi+w2). The result is a 
weighted sum of the observed x and y, 
where the greater the variance of x, the 
smaller the weight of x. 


CMH Test Statistic 


A problem associated with direct- 
ly weighting the sample data based on a 
confounding variable is that unnecessary 
stratification can lead to a decrease in 
the precision. The CMH (Cochran-Man- 
tel-Haenszel) test is used to determine if 
there is a significant association between 
two variables while controlling for a 
confounding variable. Suppose that after 
stratification, the raw data (one contin- 
gency table) is separated into n tables. 
Define the sample odds ratio of the i” 


contingency table 0:. In the 7” table below 
a b 
c d 


each odds ratio 0;is the odds of success 
for category X (a/b) over that of Y (c/d), 


which is (ad)/(bc). 
The null hypothesis for CMH test is 
0;=... = 0.=1, with an alternative hy- 


pothesis: at least one 9 is not equal to 1. 
In other words, the CMH test tests wheth- 
er the tables have different odds ratios, or 
whether the stratification is necessary. 

The CMH test estimate for the total 
population odds ratio is the weighted av- 
erage of the @’s *“", where each weight 
of the k” contingency table wi = (bucu)/ 
(a+b+c+d), representing the inverse of 
variances from hypergeometric distribu- 
tions. The estimate simplifies to: 
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The statistic uses the intuition behind the 
weighting of stratified data. 

The CMH test statistic is derived 
by conditioning on the marginal distribu- 
tions just like Fisher’s exact test, thus re- 
sulting in n hypergeometric distributions, 
but only leaving the first cell of each con- 
tingency table free. This is because the if 
the frequency of each column and row is 
being fixed, or being conditioned on, then 
the odds ratio of the k” contingency table 
can be expressed as a function of ax only. 
Specifically, the statistic is “Strat 
where: 

E(ax) = (art+bi)(aitcr)/(aitbitcitdh); 
Var(ak) = (at+br)(cxt+dk)(aitcr)(bit+dk)/ 
[(aitbitcit+de)*(artbitcitdi-1)]. 

Under the null hypothesis that the 
odds ratios are all 1, and when the sample 
size is large enough, the statistic follows 
a chi-squared distribution with 1 degree 
of freedom, since the chi-squared test 
can be used to determine whether there 
is a difference between the null distri- 
bution of frequencies and the observed 
distribution of frequencies, and 2-by-2 
contingency tables are used, so the de- 
grees of freedom is (2-1)(2-1) = 1. By 
using the CMH test rather than a series of 
chi-squared tests, we can lower the Type 
I error rate by avoiding the multiple com- 
parison problem, which arises when we 
perform multiple statistical tests on the 
same data set. When we perform multiple 
tests, the probability of observing at least 
one significant result by chance increas- 
es., thus increasing the occurrence of 
Type I errors. Nathan Mantel 
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Thomas Bayes 
Thomas Bayes (1702-1761) was an English mathematician, statistician, and 
Presbyterian minister, who is best known for developing Bayes theorem. Bayes the- 
orem is a fundamental concept in probability theory that describes how to update the 
probability of a hypothesis based on new evidence. Today, Bayes’ theorem is used in 
a wide range of fields, including statistics, computer science, economics, and engi- 
neering, among others. 
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la Interested? 
Contact the author 
at sophia. yx.zhu@ 
gmail.com. 


Proof of the Bayes Theorem 


The rabahtea theorem states: 


X|Y)P(Y) 
P(Y|X) = =T. Pi Y)P(X|Y acy 
First, we can derive the fiona, 
PIX) = “Bey 


directly from the fundamental rule of 
probability, and P(X Y) can further be 


written as: 


P(X NY) = P(X)P(Y|X) 


using the multiplication rule of probabil- 
ity. 

We can then expand the denomi- 
nator using the total probability theorem. 
That is, 

P(X) = Yij-1 P(X Ue) 

if X is discrete or 
P(X) = f* P(Y)P(X|Y)d(Y) 

if X is continuous. The probability P(X) is 

also known as the marginal likelihood of 

x, 


= Vins P(Ye)P(X|¥E) 


Diagnostic Ratio Likelihood 


The diagnostic likelihood ratio 
(DLR) is a statistical measure that quan- 
tifies how much the results of a diagnos- 
tic test can change the probability of a 
patient having a particular condition. To 
show how the post-test odds are calculat- 
ed, We first define the following metrics: 

1) The sensitivity is the probability 


DIAGNOSTIC RA- 
TIO LIKELIHOOD 


of the test being positive given the sub- 
ject has the disease, or P(+ | D). 

2) The specificity is the probability 
of the test being negative given the sub- 
ject doesn’t have the disease, or P(- | D°). 

3) DLR of a positive test (DLR’) is 
the odds of correctly outputting a positive 
result, which is P(+ | D) /P(+ | D, or 
sensitivity / (1 - specificity). 

4) DLR of a negative test (DLR) is 
the odds of correctly outputting a nega- 
tive result, which is P(+ | D)/P(+ | DY, 
or (1 - sensitivity) / specificity. 

5) The positive pre-test odds refer 
to the odds of disease before the test, or 
P(D)/ P(D‘). Similarly, the negative pre- 
test odds are P(D‘) / P(D). 

6) The positive post-test odds refer 
to the odds of disease after the test, or 
P(D | +)/ P(DS | +). Similarly, the nega- 
tive post-test odds are P(D | -)/ P(D© | -). 

From the discrete version of the 
Bayes rule, we have: 

P(DI+) = BrapyPCb}F PCTBCTPIBE} 
and . 

P(D1+) = BoD DAP). 
Dividing Fe ne we get 

P(D°|+) ~ P(+|D°) P(DS) . 
So, the positive post-test odds are the 
DLR’ times the positive pre-test odds. 
The result similarly applies to the nega- 
tive post-test odds. m 
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CREDIBLE INTERVAL AND 
THE BETA DISTRIBUTION 


Credible Intervals 
vs. Confidence Intervals 


Bayesian credible intervals are a 
way of quantifying the uncertainty in an 
estimated parameter or set of parameters 
in a Bayesian statistical model. Unlike 
frequentist confidence intervals, which 
are based on repeated sampling and prob- 
ability limits, Bayesian credible intervals 
are based on the posterior distribution of 
the parameter(s) of interest, which takes 
into account both the data and any prior 
information. 


Constructing a 


Credible Interval 


To construct a Bayesian credible in- 
terval, we first need to define a prior dis- 
tribution that represents our beliefs about 
the parameter of interest before seeing 
the data. Then, we need to compute the 
posterior distribution of the parameter 
given the observed data. Finally, we can 
use the posterior distribution to compute 
the credible interval, which is a range of 
values that contains a specified percent- 
age of the posterior distribution. 

For example, if we want to con- 


-struct a 95% credible interval for the 
mean of a normal distribution, we could 
start with a prior distribution for the mean 
(e.g., anormal distribution with a speci- 
fied mean and variance). Then, we could 
use Bayes’ theorem to update the prior to 
the posterior distribution, which would be 
another normal distribution with updated 
mean and variance based on the observed 
data. Finally, we could compute the 2.5” 
and 97.5” percentiles of the posterior dis- 
tribution to obtain the 95% credible inter- 
val. This interval would contain the true 
mean with 95% probability, given the 
observed data and the prior distribution. 


Strength 


Credible intervals are a strength of 
Bayesian statistics because they provide 
a probabilistic statement about the range 
of plausible values for a parameter, given 
the observed data and the prior distribu- 
tion. The interval can be interpreted as 
the range of values for the parameter that 
has a high probability of being the true 
value, given the data and the prior. 

The credible interval can also be 
used to make decisions with regard to the 
null hypothesis rather than the alternative 
hypothesis. If the null hypothesis falls 
within the credible interval, then it 1s 
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Jakob Bernoulli 
Jakob Bernoulli contributed to the development of calculus, probability theory, 
the mathematical constant e, and the binomial distribution. 
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plausible that the null hypothesis is true, 
given the data and the prior. If the null 
hypothesis does not fall within the cred- 
ible interval, then it is less plausible that 
the null hypothesis is true, given the data 
and the prior. 

However, it is important to note 
that the strength of the credible interval 
depends on the prior distribution chosen. 
If the prior distribution is not well-in- 
formed or is misspecified, then the result- 
ing credible interval may not accurately 
reflect the range of plausible values for 
the parameter. Therefore, careful consid- 
eration should be given to the choice of 
prior distribution in Bayesian analysis. 


The Beta Distribution 


The beta distribution is a continu- 
ous probability distribution that is defined 
on the interval [0, 1]. It is a two-parame- 
ter family of distributions, and its shape 
is determined by the values of these pa- 
rameters. The distribution is often used as 
a model for the distribution of probabili- 
ties or proportions. 

The probability density function 


(PDF) of the beta distribution 1s given by: 
f(z) = wey - 297) - (1 — °F. 


Bla,b) 
where x is a random variable between 0 
and 1, a and b are positive shape param- 
eters, and B(a, 5) is the beta function, 
defined as: 

B(a,b) = fy t9-) (1 —t)?"adt, 
The reciprocal of the beta function serves 
as an adjustment to the density function, 
scaling its cumulative probability to 1. 


The expected value of the beta distri- 
bution is E[x] = art, and the variance is 
Var[x] = =, The beta distribution 
is often used as a conjugate prior for the 
binomial distribution in Bayesian analy- 
sis, as it allows for closed-form updates 
of the posterior distribution. The beta dis- 
tribution is also used in a wide range of 
applications, including modeling propor- 
tions, Bayesian inference, and Bayesian 
hypothesis testing. 

By using a beta distribution as the 
conjugate prior of a binomial Bayesian 
model, we observe that the posterior is 
also a beta distribution. Specifically, if we 
have a binomial distribution with param- 
eters n and p, and we use a beta distribu- 
tion with parameters a and b as the prior 
for p, then the posterior distribution for 
p given the data is also a beta distribu- 
tion with updated parameters a’ and b’, 
where: 

a’ =a+number of successes in the data; 
b> = b+ number of failures in the data. 

To see the above, we first write out 
P(x = kip) = (;)p*(1—p)"™*, Using Bayes 
theorem, P(p|k) « P(k|p)P(p). Substituting 
the beta distribution as the prior, we have 
P(p|k) x (p*(1 — p)"~*)(p*~*(1 — p)’*). This 
implies P(p|k) « p*+¢-1(1 — p)"-***"!. This 
is equivalent to p*(1-p)’. In the result- 
ing posterior distribution, there should 
be a coefficient of (i) and a scaling fac- 
tor defined by the reciprocal of the beta 
function. Thus, we have shown that the 
beta prior yields a Bayesian model that 
also follows a beta distribution, which is 
called the beta-binomial distribution. 
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Nuisance Parameters 


In statistics, a nuisance parameter 
is a parameter that is not of direct interest 
in a study but must be accounted for in 
order to make valid inferences about the 
parameters of interest. 

In Bayesian statistics, nuisance 
parameters are typically treated as ran- 
dom variables and marginalized out of 
the posterior distribution. This means that 
credible intervals take into account the 
uncertainty in the nuisance parameters 
and reflect the uncertainty in the param- 
eter of interest, given the data and any 
prior information, making Bayesian sta- 
tistics particularly useful in modeling the 
variation/noises in machine learning. 

The process of eliminating nui- 
sance parameters is called marginaliza- 
tion, and it involves integrating the joint 
posterior distribution over the nuisance 
parameters to obtain the marginal pos- 
terior distribution of the parameters of 
interest. This can be done using Bayes’ 
theorem and the law of total probability, 
which states that the joint probability of 
two or more events is equal to the sum of 
their individual probabilities. 

For example, suppose we have a 
Bayesian model with two parameters of 
interest, 8 and ©, and a nuisance parame- 
ter, I. Let y denote the posterior. The joint 
posterior distribution of the three parame- 
ters can be written using the Bayes rule: 

P(0,,T \y) « P(y|0, 6,0 )x(0)x()x(T). 
To eliminate the nuisance parameter gam- 
ma, we can integrate over all possible 


values of I, which gives us the marginal 
posterior distribution of 0 and 9: 


PU |y) = JO, P(8, 4.0 ly)a(P) x JPL, P(y|8, 6,1) n(8)m(d)n(L)d(L). 


This marginal posterior distribution 
of theta and phi represents our updated 
beliefs about these parameters based on 
the data, after marginalizing out the nui- 
sance parameter I° 

In contrast, frequentist statistics 
typically treat nuisance parameters as 
fixed, unknown values and use various 
techniques such as maximum likelihood 
estimation or method of moments to es- 
timate their values. Confidence intervals 
are then constructed to reflect the uncer- 
tainty in the parameter of interest, given 
the data and the estimated values of the 
nuisance parameters. 

In practice, marginalization can 
be computationally expensive. Howev- 
er, eliminating nuisance parameters can 
lead to more accurate inference and more 
informative posterior distributions for the 
parameters of interest. In general, Bayes- 
ian methods can be more flexible in deal- 
ing with nuisance parameters. m 
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THE FIBONACCI SEQU- 
-ENCE & LINEAR MAPS 


Formulating the Problem 


Let fi, £2, ... denote the Fibonacci 
sequence for each fi in F defined by 

fi=1,f2=1, and fa= frit frz. 

The problem is to find the value of 
the terms in the Fibonacci sequence. 


Defining a Linear Map 


We first define an operator T on R* 
as T(x, y) =(y, x+y). We easily get that 
7T"(0, 1) = TeT*...°7(0, 1) = TeT*...e7C1, 1) 
=... = (fr, fr+1) by definition of T. 

The eigenvalues of T are found 


through the set of equations 
x= dy 
y=A(r+y) 
as “3 and +“. The eigenvectors are nat- 
urally of the form (x, “x) and 
(x, 34x) by x =Ay. 

We can write a new set of basis for 
R2: bi = (1, 434) and bz = (1, 45%). After 
the change in basis, (0, 1) =v<(bi- b2). 
Thus, 7*(0, 1) =v7"(bi- b2) = 7(T"bi - 
T"b2) = (fr, fri). 

T'(b1) = TeT*...*T(1, 48“) = 
TTe...°T( 434, (429) = (429)"b:. Sim- 
ilarly, 7°(b2) = (43%)"b2. Hence, ¥4(7"b: 

- Tb2) = Fa((435)"b1 - (534)"b2) = (fi, fr), 
which implies fr=vs((42)" - (43%)"), the 


expression for the n” term in Fibonacci. m 


Geometric Representation of the Fibo- 
nacci Sequence 


V1 


v2 


Two Sets of Basis: 
We switched from the standard (0, 1) and 
(1, 0) set to b: and b2, changing the rela- 
tive axis of vectors in R’. 
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POLAR DECOMPOSITI- 


-ON 


Deformed 
configuration 


«(B) 


Undeformed 
configuration 


a 


Complex Numbers 


Recall that any v in C can be ex- 
pressed as |vje”, where |v| = tv is a scal- 
ing coefficient representing the length 
of v and e” is a complex number of unit 
length representing the direction of v on 
the complex coordinate. It can be seen 
as decomposing v into a scaling |v| and a 
rotation by 0. 


Polar Decomposition 


The polar decomposition theo- 
rem states that any linear map 4 in C”*” 
can be decomposed into a product of 
two matrices, one of which is a unitary 
isometric matrix and the other is a posi- 
tive-semidefinite Hermitian matrix. The 
theorem states that 4 = UP, where U in 
C”’*" is a unitary matrix, which means that 


OF LINEAR MAPS 


its inverse is equal to its conjugate trans- 

pose, and P in C””*” is a positive-semidef- 
inite Hermitian matrix, which means that 
all of its eigenvalues are non-negative. 

Similar to the polar form of com- 
plex numbers, we can first derive the 
value of P, representing the scaling factor 
of A: 

A‘A = (UP)'(UP) = P"(U'U)P. 

Since U is the rotational factor of 
A, or a rotation matrix, it can be shown 
that the columns of U with regard to the 
standard bases of an m-dimensional vec- 
tor space are orthogonal to each other. 
Then, U'U = In = U'U. This implies A‘A 
= P™(U'U)P = P'P, so P= (A'A)!”. 

On the other hand, the rotational 
factor U can be found by substituting P 
into A = UP, implying U = A(A‘A)"”. In 
other words, let Au, ..., An be the eigenval- 
ues of (A74)!'” and x1, ..., xn be the eigen- 
vectors. There exists an integer r <n such 
that Arti, ..., An = 0 as A 1s not necessarily 
full rank and invertible. We establish that 
Pxi = dixi. Since Axi = UPxi = Uhixi, 1m- 
plying Uxi = (1/Ai)Axi wheni <r and Uxi 
= xi when i>r. U can be written as: P 

| 


rilsa, a P - P 
[3,421 3 Az, Beit sa Zn] 


gH 


aan 
b) 


which also suffices the previous defini- 
tion that U = A(A‘A)'?. 
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Brook Taylor (1685-1731) was an English mathematician who made significant con- 
tributions to several areas of mathematics, including calculus, algebra, and geometry. 
He is best known for his work on Taylor series, which provide a way to represent 
functions as infinite sums of polynomial terms. 
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DERIVATION OF THE 
TAYLOR SERIES USING 


VECTOR 


Projection 

The Taylor series can be seen as 
the orthogonal projection of a continu- 
ous function fin V onto a polynomial g 
in polynomial space P. Specifically, we 
want to minimize the quantity ||f- g||? = 
<f, g> =! flx)g(x) de. 

The shortest distance between the 
fand g is given by the vertical projection 
of fonto g, since obviously ||f- proje f[|’ 
S ||f- proje fll’ + |[proje f- gl?’ = IIf- gil’ by 
the Pythagorean theorem. To minimize 
the L2 loss ||f- g||’, the optimal value of 
g is given by (projp f). The projection of f 
onto the polynomial space P can be de- 
composed into: 

g = projP f= <b), feb + ... + <ba, fobn 
with regard to an orthonormal basis set 
(bi, ..., bn). 


PROJECTION 


Finding the Basis 
The remaining task is to identify 
the orthonormal basis. Using the Gram- 
Schmidt orthonormalization process, 
given a basis set (e/, ..., en), we can find 
the b’s as follows: 
bi= ei. 
b2’ = (e2 - projspan(s) €2); b2= bz’ / |\b2’ ||. 


bn? = €n - PLOJspan(b) En - ... - PLOJspan(d...) En} 
bn = bn’ / ||bn’ ||. 
This process eliminates the components 
of the basis vector that have already been 
accounted for by previous basis vectors, 
and then normalizes them, where each bi’ 
denote an intermediate step that has not 
yet been normalized yet. 
The result would be the same as the 

Taylor series. m 
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ALTERNATE 


DEFINITIO- 


-N OF AFFINE SUBSETS 


Definition 


An affine subset W of a vector 
space V is x+U, for x in V and U a sub- 
space of V. An alternate definition of 
affine subsets is W is an affine subset if 
and only if for every v and w in W, (1-A)v 
+ dw is in W. 


Proof 


If W is an affine subset, W = x+U 
for chosen v and U. Let v and w be arbi- 
trary vectors in W. Then, v = x+wi and w 
= x+u; for chosen ui, uj in U. It follows 
that: 

(1-A)v + Aw 

= (1-A)(xtui) + Axt+u;) 

=x + ui- Ax - duit dx + AY 

=x + ui-Auit dy. 
By definition of a subspace, (ui - Awi + 
Auj) must be in U. Thus, (1-A)v + Aw = 
x + ui - Aui + Au; must be in W. 

Conversely, if (1-A)v + Aw is in W, 


it is implied that v + A(w - v) is in W. This 


further implies A(w - v) is in the subset 
(W - v). On the other hand, since w is in 
W, (w - v) is in the subset (W - v). Hence, 
if v and w are in W, (1-A)v + Aw implies 
that W is closed under scalar multiplica- 
tion. 

Let v’ and w’ also be vectors in W. 


Then, (w’- v) and (v’- v) is in the subset 
(W - vy). Since (1-A)v’+ Aw’1s in A, tak- 
ing A=0.5 gives us 0.5(v’+ w’) is in W, or 
(0.5v’+ 0.5w’- v) is in (W - v). By clo- 
sure of scalar multiplication, (v’+ w’- 2v) 
=(v’- v) + (w’- v) is in the subset (W - 
v). Therefore, for arbitrary vectors (w’- v) 
and (v’- v) in (W - v), (v’- v) + (w’- v) is 
also in (W - v). In other words, (W - v) is 
closed under vector addition. Since v is 
in W, it can be further inferred that W is 
closed under addition. 

Since W is closed under scalar 
multiplication and vector addition, it is an 
affine subset of V. m 


An affine subset W of a vector space V 
is a subset closed under scalar multipli- 
cation and vector addition. It does not 
contain the origin but is parallel to a sub- 
space U of the V, as shown in the picture 
as the red plane parallel to the grey one. 
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VARIANCE & EXPLODING 
OR VANISHING GRADIENT 


Parameter Initialization 


In deep learning, when computing 
the gradients in the backward pass, some- 
times vanishing gradients or exploding 
gradients are encountered. Vanishing 
gradients are when the gradients become 
extremely small as the weights that are 
multiplied to the gradient signals get very 
small. In contrast, exploding gradients are 
when the gradients become extraordinari- 
ly large. If the parameters are initialized 
poorly, the model may have difficulty 
converging to an optimal solution during 
training. Hence, parameter initialization 
becomes especially important. In this ar- 
ticle, we will explore how parameters in 
the weight matrices Ox at the k” layer can 
be initialized to best prevent inefficient 
training (e.g., vanishing or exploding gra- 
dients) and poor performing models with 
a ReLU activation function. 


Variance of Standard ReLU 


The ReLU activation function fol- 


lows: 
X, ifx >0; 


X’=ReLU(X)=¢ 7” 
O, ifX <0. 


Suppose we initialize the bias vectors Bx 
as zero vectors and the Q:’s using Gauss- 
ian with mean (1) 0. Then we would want 


the weights to be evenly spread out so 
that none of the weights are extremely 
large or small. Therefore, the task is to 
find the optimal variance of Qx. 
Let he be the k” hidden layer, fibe 
Be+ OQxrhe, and Dx be the dimension of 
hx. he is naturally ReLU(/-7) = ReLU(Bx 
+ Oishii). It follows that the variance of 
each i” element of fi, 
Var(fi.r) 
= E(fii’) - E(fir)? 
= E([Biet Dj) Qepchj«]’), 
as E(fi) = E(Bx) + E(Qshe) = 0 + E(Qx) 
E(h«) = 0, and the summation of 7 from 0 
to De is derived from matrix multiplica- 
tion. Thus, 
Var(fi) 
= E(B? + D) 2BirQe pyr + LY QGp 2h) 
= EX Q6,7hj7) 
= Fj EQan AEM), 
as Bxis a zero vector, so Bix = 0. Since 
Yi E(Q6.7) 
= Var(Qiin.s) - EQ), 
= Var(Qi.jx) and E(hjx) 
= Var(hjx) - E(hj.x)’ 
= Var(hir), Sj EQ nPAEr) 
= Yj Var(Qa,;),«) Var(hj.x) 
= DkVar(Qir)Var(he) 
= Dk Var(QinE(he). 
Summing up every 7” element in fi, the 
total variance of fi is DeVar(Qx)E(hi). 
Now, we will show that 
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E(hi) = 0.5 Var(fi-1). 
First, Var(fi-1) = E(fi-’) - Effi-)? = E(fi-r”) - 0 = Joe Fe 1 PU Pe-1)d(fe—1) 
E(hzZ) = f°, max(0, fg_1)  P(fe-1)d(fe-1) 
=f, fi 1° P(fe-1)d(fr—-1) 
= 5 fo Fh PUfe-1)d(fe-1) = $V ar(fe_1) 
Substituting the above result into the Var(fi), we get Var(fi) = 0.5DkVar(Qs) Var(fi-1). 
Since our goal is to maintain the variance across layers (making Var(fi-1) and Var(fx) 
as Close as possible), we should set Var(Qx) to be 2/Dx. 
Similarly, in the backward pass, we calculate the variances in reverse order. 
Thus, we would want Var(Qk) = 2/Dss1. 
The He initialization calculates the Var(Qx) by taking the average of the two 
variances. That is, 
Var(Qx) = 0.5(De+ Du+t). 
The He initialization helps mitigate the issues of vanishing or exploding gradients 
by ensuring that the weights are initialized to reasonable values that do not cause the 
gradients to explode or vanish during backpropagation. This, in turn, can lead to fast- 
er and more stable training of deep neural networks and lead to better performance. 


h 


Kaiming He, Chinese computer scientist and AI researcher, and the inventor of sever- 
al influential algorithms, including ResNet and He initialization 


DD 


56 


BIZZARE MATHEMATICS 


BAYESIAN 
IMAGE 


The Bayesian Framework 


Image denoising is a process of 
removing noise from an image. Bayesian 
image denoising is a popular method that 
involves the use of a Bayesian framework 
to estimate a clean image from a noisy 
observation. 

In Bayesian image denoising, the 
goal is to find the posterior probabili- 
ty distribution of the clean image given 
the noisy observation. This is given by 


Bayes’ theorem: 


— PUmyzocisy|IMectean)' PU mectean) 
P(IMeiean|F noisy) = X 


PU myoisy) ; 
where PU metean|IMnoisy) is the posterior 
distribution of the clean image, 
P(Imroisy|IMetean) jg the likelihood of the 
noisy observation given the clean image, 
P(!metean) 18 the prior distribution of the 
clean image, and P(/™noisy ) 1s the evi- 
dence term that normalizes the posterior 
distribution. 

To estimate the clean image, a 
Bayesian loss function is used. The 
Bayesian loss function is defined as the 
negative logarithm of the posterior distri- 
bution, which measures the discrepancy 
between the estimated and the true clean 
image. 

By minimizing the Bayesian loss 
function, we can find the optimal estimate 
of the clean image that is most consistent 


LOSS IN 


RESTORATION 


with the noisy observation and the prior 
distribution. 

One of the advantages of the 
Bayesian approach is that it allows us to 
incorporate prior knowledge about the 
image into the denoising process, such as 
the smoothness or sparsity of the image. 
This helps to reduce the impact of noise 
on the image while preserving its import- 
ant features. 


Noise To Void (N2V) 


The N2V algorithm (Krull et al.) 
was developed to resolve the impractical 
limitations of the Noise To Noise (N2N) 
algorithm (Lehtinen et al.), which re- 
quires two versions of noisy images of 
the same clean image so that they can 
learn from each other the underlying dis- 
tribution of the clean image. N2V, on the 
other hand, requires only one noisy image 
to recover the clean image. 

The basic idea behind N2V is to 
train a deep neural network to predict the 
original image pixels from the noisy in- 
put image pixels. The network is trained 
using a Bayesian loss function that incor- 
porates the properties of the noise distri- 
bution. The N2V algorithm uses a U-Net 
network architecture with masked blind 
spots, where the encoder part of the 
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network maps the noisy image to a 
low-dimensional feature space, and the 
decoder part of the network maps the 
low-dimensional features back to the de- 
noised image. Before we pass the image 
through the DNN, we first mask the noisy 
image through masks with certain shapes 
that generate blindspots, representing po- 
tential noisy pixels that need restoration 
using the distribution of adjacent pixels. 
The N2V algorithm uses a Bayes- 
ian loss function, which involves defining 
a prior distribution over the clean image 
space and then finding the maximum of 
the posterior distribution given the noisy 
input image. Denote the environment 
around a pixel, or its surrounding pixels, 
as Envpixet. We obtain the following ex- 
pression directly from the Bayes rule: 
P(Tmetcan|TMnoisy) & P(IMnoisy|IMetean)P(IMetean|EMpizel) 
In the above equation, the posterior distri- 
bution of the clean image given the noisy 
image is unknown, while the prior distri- 
bution of the noisy image given the clean 
image is assumed to be a Gaussian distri- 
bution with a mean of zero and a variance 
that depends on the local image structure, 
and the distribution of the clean image 
given the environment is the output of the 
U-Net that we intend to optimize. 
Integrating over all potential clean 
images given the noisy, we get the mar- 
ginal distribution of the noisy image giv- 
en the environment using the Bayes rule: 
P(Imnoisy|Envpizet) = J, P(Imnoisy|IMetean) P(Imetean|Entpizet)d(Imetean), 
where the distribution of the clean images 
given the environment is the output of the 
U-Net and the distribution of the noisy 


image given the environment is observed 
from the training data. 

Hence, the likelihood function is 
also modeled as a Gaussian with a mean 
equal to the noisy input image and a vari- 
ance that is a function of the local envi- 
ronment image structure. The formal loss 
function during training is found by min- 
imizing the negative log-likelihood of the 
integral, which represents the maximum 
likelihood estimate of PUmnoisy| Envpizet) 
since such MLE implies the most proba- 
ble value of the pixel from its context. 

During the testing phase, we bring 
in PU mMetean|IMpoisy) again from below: 

PU metean|IMnoisy) ox P(Imnoisy|!Mctean)P(IMetean|ENUpizet) 
As PU metean|ENpixet)is obtained from the 
neural network, we can compute 
E(P(Imetean|IMnoisy, EnUpixet)), the predicted 
clean output. 

N2V does not require any explicit 
ground truth images and can handle dif- 
ferent types of noise, including Gaussian, 
Poisson, and speckle noise. m 
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COSINE SIMILAR- fin 


about this maga- 


ITY LOSS IN NLP Ginicnessi 


at sophia. yx.zhu@ 
gmail.com. 

X> 
Item 1 Try This 


Question: 


Proof that 
if 0 is the angle 
between a and 5, 
cos(Q) is equal to: 

=> 


Cosine Distance 


<a,b> 
Teatcal 
Text Embedding using the cosine 
rule. = 
In natural language processing (NLP), a vector is a math- 
ematical representation of a text document that has each of its 
Text B Text A 


word/phrase encoded as a numerical value in a high-dimension- 
al space. This process is called text embedding. 


Geometric Interpretation of the Loss 


T Japooug 


Cosine similarity loss is a measure of the similarity be- 
tween two vectors, commonly used in natural language pro- 
cessing (NLP). By minimizing the loss, we aim to let two em- 
bedded texts represented as vectors be as similar as possible. 

The cosine similarity ranges from -1 to 1, with a value 
of 1 indicating that the vectors are identical, and a value of -1 
indicating that they are completely dissimilar. The formula of 
cosine similarity for two vectors q and b follows: Cosine 

<ad.b> Similarity Loss 


of Shed 
aCe 


Vector A 
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REDUCING 


ERRORS _ IN 


LEAST SQUARES: VARI- 
ANCES, BIASES, & NOISES 


Dividing Least Squares 


The least squares equation between 
a neural network f with parameter x and 
the sample outputs y(x) is L = (f(x) - 
y(x))*. By adding and subtracting a term 
L(y), we get 
ve 


= (Ax) - HO] + [aO”) - ve) ])? 
= (fx) - WO)? + 2A) - WOO) - VO) 
+ (WO) - y@)/Y. 
Taking the expected value of both sides 
with respect to y gives 
E,(L) 

= Ex(fx) - wy)? + 2°Es(fx) - 
L(y))*Ex(u(y) - vx) + Ev(u(y) - yx)’ 
= [Ax) - wO)P + 2*[Ax) - w)]90 + oy 
= [Ax) - HOP + oy. 
Thus, the expected loss can be separat- 
ed into two parts: the summed squared 
deviations of the model output and the 
mean of y and the variance of the sample 
training outputs. The additional variance 
in the sample labels is unavoidable. The 
term oy is called the noise of the model, a 
type of insurmountable error that result- 
ed from the variance of the training data 
itself. 

Since our neural network depends 


on the sample training input data 7, we 
denote fas fi. The term [fi(x) - u(y)? in 
E(L) can further be divided into two other 
types of errors: variance and bias. 

[fi(x) - wy)? 
= [(fix) - WA) + (H(f) - WY) P 
= [filx) - WAP + 2+Lfi@X) - WO) ILW) - 
w(y)] + [E(fi) - HO”). 
Taking the expected value with respect to 
i gives 

Ei([fi(x) - 10”) P) 
= [Ei(fi(x)) - WA)? + 2°0°[W(F) - wOv)] + 
[M(fi) - WO)? 
= [Ei(fi(x)) - WAP + [wf - wOYP. 
Thus, the expected loss with respect 
to the specific training dataset i can be 
divided into the variance of the neural 
network / trained on i, the bias term de- 
fined by the summed squared deviation 
between the mean of the neural network 
output and the mean of y, and the noise. 
The bias is also known as the systematic 
error of the model, as it is the deviation 
between the average output value and the 
average predicted value. In other words, 

E(Ey([fi(x) - w(v)]) ) 
= [Ed(fi(x)) - wf? + [w(fi) - wo) + oy’, 
with the first term representing variance, 
the second bias, and the third noise. 
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BIZZARE MATHEMATICS 
Minimizing Errors 


Due to the natural variability of the sample training dataset, the noise compo- 
nent of least squares loss (L2 loss) is unavoidable. However, we can reduce the vari- 
ance and bias, which are reliant on the neural network model. 

To minimize the variance, we can simply gather more sample data to be more 
certain about the specific output of a given input x; as the number of data points ap- 
proaches infinity, the predicted output converges to the single true unbiased point 
estimate. 

To minimize the bias, we can increase the capacity of our model by adding 
more hidden layers or hidden units in each layer, which results in more linear regions 
that help the model to better resemble a curved surface. 

However, note that there might exist a trade-off between the minimization of 
variance and bias. If we increase the capacity of the neural network, the network 
might become overfitted on the training dataset and be overly affected by the noises, 
which are not generalizable to the true distribution of data. Thus, an optimal capacity 
of the neural network should be carefully selected to avoid such problems. m 
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